80 research outputs found

    npInv: accurate detection and genotyping of inversions using long read sub-alignment

    Get PDF
    BACKGROUND: Detection of genomic inversions remains challenging. Many existing methods primarily target inzversions with a non repetitive breakpoint, leaving inverted repeat (IR) mediated non-allelic homologous recombination (NAHR) inversions largely unexplored. RESULT: We present npInv, a novel tool specifically for detecting and genotyping NAHR inversion using long read sub-alignment of long read sequencing data. We benchmark npInv with other tools in both simulation and real data. We use npInv to generate a whole-genome inversion map for NA12878 consisting of 30 NAHR inversions (of which 15 are novel), including all previously known NAHR mediated inversions in NA12878 with flanking IR less than 7kb. Our genotyping accuracy on this dataset was 94%. We used PCR to confirm the presence of two of these novel inversions. We show that there is a near linear relationship between the length of flanking IR and the minimum inversion size, without inverted repeats. CONCLUSION: The application of npInv shows high accuracy in both simulation and real data. The results give deeper insight into understanding inversion

    Fregene: Simulation of realistic sequence-level data in populations and ascertained samples

    Get PDF
    Background: FREGENE simulates sequence-level data over large genomic regions in large populations. Because, unlike coalescent simulators, it works forwards through time, it allows complex scenarios of selection, demography, and recombination to be modelled simultaneously. Detailed tracking of sites under selection is implemented in FREGENE and provides the opportunity to test theoretical predictions and gain new insights into mechanisms of selection. We describe here main functionalities of both FREGENE and SAMPLE, a companion program that can replicate association study datasets.Results: We report detailed analyses of six large simulated datasets that we have made publicly available. Three demographic scenarios are modelled: one panmictic, one substructured with migration, and one complex scenario that mimics the principle features of genetic variation in major worldwide human populations. For each scenario there is one neutral simulation, and one with a complex pattern of selection.Conclusion: FREGENE and the simulated datasets will be valuable for assessing the validity of models for selection, demography and population genetic parameters, as well as the efficacy of association studies. Its principle advantages are modelling flexibility and computational efficiency. It is open source and object-oriented. As such, it can be customised and the range of models extended

    Evaluation of Host Serum Protein Biomarkers of Tuberculosis in sub-Saharan Africa.

    Get PDF
    Accurate and affordable point-of-care diagnostics for tuberculosis (TB) are needed. Host serum protein signatures have been derived for use in primary care settings, however validation of these in secondary care settings is lacking. We evaluated serum protein biomarkers discovered in primary care cohorts from Africa reapplied to patients from secondary care. In this nested case-control study, concentrations of 22 proteins were quantified in sera from 292 patients from Malawi and South Africa who presented predominantly to secondary care. Recruitment was based upon intention of local clinicians to test for TB. The case definition for TB was culture positivity for Mycobacterium tuberculosis; and for other diseases (OD) a confirmed alternative diagnosis. Equal numbers of TB and OD patients were selected. Within each group, there were equal numbers with and without HIV and from each site. Patients were split into training and test sets for biosignature discovery. A nine-protein signature to distinguish TB from OD was discovered comprising fibrinogen, alpha-2-macroglobulin, CRP, MMP-9, transthyretin, complement factor H, IFN-gamma, IP-10, and TNF-alpha. This signature had an area under the receiver operating characteristic curve in the training set of 90% (95% CI 86-95%), and, after adjusting the cut-off for increased sensitivity, a sensitivity and specificity in the test set of 92% (95% CI 80-98%) and 71% (95% CI 56-84%), respectively. The best single biomarker was complement factor H [area under the receiver operating characteristic curve 70% (95% CI 64-76%)]. Biosignatures consisting of host serum proteins may function as point-of-care screening tests for TB in African hospitals. Complement factor H is identified as a new biomarker for such signatures

    Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies

    Get PDF
    Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation

    Natural and Orthogonal Interaction framework for modeling gene-environment interactions with application to lung cancer

    Get PDF
    Objectives: We aimed at extending the Natural and Orthogonal Interaction (NOIA) framework, developed for modeling gene-gene interactions in the analysis of quantitative traits, to allow for reduced genetic models, dichotomous traits, and gene-environment interactions. We evaluate the performance of the NOIA statistical models using simulated data and lung cancer data. Methods: The NOIA statistical models are developed for additive, dominant, and recessive genetic models as well as for a binary environmental exposure. Using the Kronecker product rule, a NOIA statistical model is built to model gene-environment interactions. By treating the genotypic values as the logarithm of odds, the NOIA statistical models are extended to the analysis of case-control data. Results: Our simulations showed that power for testing associations while allowing for interaction using the NOIA statistical model is much higher than using functional models for most of the scenarios we simulated. When applied to lung cancer data, much smaller p values were obtained using the NOIA statistical model for either the main effects or the SNP-smoking interactions for some of the SNPs tested. Conclusion: The NOIA statistical models are usually more powerful than the functional models in detecting main effects and interaction effects for both quantitative traits and binary traits. Copyright (C) 2012 S. Karger AG, Base

    Diagnostic Test Accuracy of a 2-Transcript Host RNA Signature for Discriminating Bacterial vs Viral Infection in Febrile Children.

    Get PDF
    IMPORTANCE: Because clinical features do not reliably distinguish bacterial from viral infection, many children worldwide receive unnecessary antibiotic treatment, while bacterial infection is missed in others. OBJECTIVE: To identify a blood RNA expression signature that distinguishes bacterial from viral infection in febrile children. DESIGN, SETTING, AND PARTICIPANTS: Febrile children presenting to participating hospitals in the United Kingdom, Spain, the Netherlands, and the United States between 2009-2013 were prospectively recruited, comprising a discovery group and validation group. Each group was classified after microbiological investigation as having definite bacterial infection, definite viral infection, or indeterminate infection. RNA expression signatures distinguishing definite bacterial from viral infection were identified in the discovery group and diagnostic performance assessed in the validation group. Additional validation was undertaken in separate studies of children with meningococcal disease (n = 24) and inflammatory diseases (n = 48) and on published gene expression datasets. EXPOSURES: A 2-transcript RNA expression signature distinguishing bacterial infection from viral infection was evaluated against clinical and microbiological diagnosis. MAIN OUTCOMES AND MEASURES: Definite bacterial and viral infection was confirmed by culture or molecular detection of the pathogens. Performance of the RNA signature was evaluated in the definite bacterial and viral group and in the indeterminate infection group. RESULTS: The discovery group of 240 children (median age, 19 months; 62% male) included 52 with definite bacterial infection, of whom 36 (69%) required intensive care, and 92 with definite viral infection, of whom 32 (35%) required intensive care. Ninety-six children had indeterminate infection. Analysis of RNA expression data identified a 38-transcript signature distinguishing bacterial from viral infection. A smaller (2-transcript) signature (FAM89A and IFI44L) was identified by removing highly correlated transcripts. When this 2-transcript signature was implemented as a disease risk score in the validation group (130 children, with 23 definite bacterial, 28 definite viral, and 79 indeterminate infections; median age, 17 months; 57% male), all 23 patients with microbiologically confirmed definite bacterial infection were classified as bacterial (sensitivity, 100% [95% CI, 100%-100%]) and 27 of 28 patients with definite viral infection were classified as viral (specificity, 96.4% [95% CI, 89.3%-100%]). When applied to additional validation datasets from patients with meningococcal and inflammatory diseases, bacterial infection was identified with a sensitivity of 91.7% (95% CI, 79.2%-100%) and 90.0% (95% CI, 70.0%-100%), respectively, and with specificity of 96.0% (95% CI, 88.0%-100%) and 95.8% (95% CI, 89.6%-100%). Of the children in the indeterminate groups, 46.3% (63/136) were classified as having bacterial infection, although 94.9% (129/136) received antibiotic treatment. CONCLUSIONS AND RELEVANCE: This study provides preliminary data regarding test accuracy of a 2-transcript host RNA signature discriminating bacterial from viral infection in febrile children. Further studies are needed in diverse groups of patients to assess accuracy and clinical utility of this test in different clinical settings

    Genome-wide association study of primary tooth eruption identifies pleiotropic loci associated with height and craniofacial distances

    Get PDF
    Twin and family studies indicate that the timing of primary tooth eruption is highly heritable, with estimates typically exceeding 80%. To identify variants involved in primary tooth eruption we performed a population based genome-wide association study of ‘age at first tooth’ and ‘number of teeth’ using 5998 and 6609 individuals respectively from the Avon Longitudinal Study of Parents and Children (ALSPAC) and 5403 individuals from the 1966 Northern Finland Birth Cohort (NFBC1966). We tested 2,446,724 SNPs imputed in both studies. Analyses were controlled for the effect of gestational age, sex and age of measurement. Results from the two studies were combined using fixed effects inverse variance meta-analysis. We identified a total of fifteen independent loci, with ten loci reaching genome-wide significance (p<5x10−8) for ‘age at first tooth’ and eleven loci for ‘number of teeth’. Together these associations explain 6.06% of the variation in ‘age of first tooth’ and 4.76% of the variation in ‘number of teeth’. The identified loci included eight previously unidentified loci, some containing genes known to play a role in tooth and other developmental pathways, including a SNP in the protein-coding region of BMP4 (rs17563, P= 9.080x10−17). Three of these loci, containing the genes HMGA2, AJUBA and ADK, also showed evidence of association with craniofacial distances, particularly those indexing facial width. Our results suggest that the genome-wide association approach is a powerful strategy for detecting variants involved in tooth eruption, and potentially craniofacial growth and more generally organ development

    MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS

    Get PDF
    The genome-wide association study (GWAS) approach has discovered hundreds of genetic variants associated with diseases and quantitative traits. However, despite clinical overlap and statistical correlation between many phenotypes, GWAS are generally performed one-phenotype-at-a-time. Here we compare the performance of modelling multiple phenotypes jointly with that of the standard univariate approach. We introduce a new method and software, MultiPhen, that models multiple phenotypes simultaneously in a fast and interpretable way. By performing ordinal regression, MultiPhen tests the linear combination of phenotypes most associated with the genotypes at each SNP, and thus potentially captures effects hidden to single phenotype GWAS. We demonstrate via simulation that this approach provides a dramatic increase in power in many scenarios. There is a boost in power for variants that affect multiple phenotypes and for those that affect only one phenotype. While other multivariate methods have similar power gains, we describe several benefits of MultiPhen over these. In particular, we demonstrate that other multivariate methods that assume the genotypes are normally distributed, such as canonical correlation analysis (CCA) and MANOVA, can have highly inflated type-1 error rates when testing case-control or non-normal continuous phenotypes, while MultiPhen produces no such inflation. To test the performance of MultiPhen on real data we applied it to lipid traits in the Northern Finland Birth Cohort 1966 (NFBC1966). In these data MultiPhen discovers 21% more independent SNPs with known associations than the standard univariate GWAS approach, while applying MultiPhen in addition to the standard approach provides 37% increased discovery. The most associated linear combinations of the lipids estimated by MultiPhen at the leading SNPs accurately reflect the Friedewald Formula, suggesting that MultiPhen could be used to refine the definition of existing phenotypes or uncover novel heritable phenotypes

    Identification of novel locus associated with coronary artery aneurysms and validation of loci for susceptibility to Kawasaki disease

    Get PDF
    Kawasaki disease (KD) is a paediatric vasculitis associated with coronary artery aneurysms (CAA). Genetic variants influencing susceptibility to KD have been previously identified, but no risk alleles have been validated that influence CAA formation. We conducted a genome-wide association study (GWAS) for CAA in KD patients of European descent with 200 cases and 276 controls. A second GWAS for susceptibility pooled KD cases with healthy paediatric controls from vaccine trials in the UK (n = 1609). Logistic regression mixed models were used for both GWASs. The susceptibility GWAS was meta-analysed with 400 KD cases and 6101 controls from a previous European GWAS, these results were further meta-analysed with Japanese GWASs at two putative loci. The CAA GWAS identified an intergenic region of chromosome 20q13 with multiple SNVs showing genome-wide significance. The risk allele of the most associated SNV (rs6017006) was present in 13% of cases and 4% of controls; in East Asian 1000 Genomes data, the allele was absent or rare. Susceptibility GWAS with meta-analysis with previously published European data identified two previously associated loci (ITPKC and FCGR2A). Further meta-analysis with Japanese GWAS summary data from the CASP3 and FAM167A genomic regions validated these loci in Europeans showing consistent effects of the top SNVs in both populations. We identified a novel locus for CAA in KD patients of European descent. The results suggest that different genes determine susceptibility to KD and development of CAA and future work should focus on the function of the intergenic region on chromosome 20q13

    Diagnosis of Kawasaki Disease Using a Minimal Whole-Blood Gene Expression Signature.

    Get PDF
    Importance: To date, there is no diagnostic test for Kawasaki disease (KD). Diagnosis is based on clinical features shared with other febrile conditions, frequently resulting in delayed or missed treatment and an increased risk of coronary artery aneurysms. Objective: To identify a whole-blood gene expression signature that distinguishes children with KD in the first week of illness from other febrile conditions. Design, Setting, and Participants: The case-control study comprised a discovery group that included a training and test set and a validation group of children with KD or comparator febrile illness. The setting was pediatric centers in the United Kingdom, Spain, the Netherlands, and the United States. The training and test discovery group comprised 404 children with infectious and inflammatory conditions (78 KD, 84 other inflammatory diseases, and 242 bacterial or viral infections) and 55 healthy controls. The independent validation group comprised 102 patients with KD, including 72 in the first 7 days of illness, and 130 febrile controls. The study dates were March 1, 2009, to November 14, 2013, and data analysis took place from January 1, 2015, to December 31, 2017. Main Outcomes and Measures: Whole-blood gene expression was evaluated using microarrays, and minimal transcript sets distinguishing KD were identified using a novel variable selection method (parallel regularized regression model search). The ability of transcript signatures (implemented as disease risk scores) to discriminate KD cases from controls was assessed by area under the curve (AUC), sensitivity, and specificity at the optimal cut point according to the Youden index. Results: Among 404 patients in the discovery set, there were 78 with KD (median age, 27 months; 55.1% male) and 326 febrile controls (median age, 37 months; 56.4% male). Among 202 patients in the validation set, there were 72 with KD (median age, 34 months; 62.5% male) and 130 febrile controls (median age, 17 months; 56.9% male). A 13-transcript signature identified in the discovery training set distinguished KD from other infectious and inflammatory conditions in the discovery test set, with AUC of 96.2% (95% CI, 92.5%-99.9%), sensitivity of 81.7% (95% CI, 60.0%-94.8%), and specificity of 92.1% (95% CI, 84.0%-97.0%). In the validation set, the signature distinguished KD from febrile controls, with AUC of 94.6% (95% CI, 91.3%-98.0%), sensitivity of 85.9% (95% CI, 76.8%-92.6%), and specificity of 89.1% (95% CI, 83.0%-93.7%). The signature was applied to clinically defined categories of definite, highly probable, and possible KD, resulting in AUCs of 98.1% (95% CI, 94.5%-100%), 96.3% (95% CI, 93.3%-99.4%), and 70.0% (95% CI, 53.4%-86.6%), respectively, mirroring certainty of clinical diagnosis. Conclusions and Relevance: In this study, a 13-transcript blood gene expression signature distinguished KD from other febrile conditions. Diagnostic accuracy increased with certainty of clinical diagnosis. A test incorporating the 13-transcript disease risk score may enable earlier diagnosis and treatment of KD and reduce inappropriate treatment in those with other diagnoses
    corecore